For all R users, CRAN is the place were you can find most of the packages you need to run code. Last month, I saw this Xkcd drawing :

And this remind me of something : behind packages, you can find people. And this people are at the heart of all your analytics pipelines because they build, maintain and develop R packages. It’s their ideas and their work, discussion and collaborations that made R as it is. One example of it is the fantastic story of the pipe (%>%) by Adolfo Alvarez.

Let’s talk a bit about this people and their packages !

Which packages are we talking about ?

Not all packages on CRAN are equal. They are all useful and all needed a big bunch of work. But some are more useful than others, also because they are more general.

I try to appreciate it by comparing two things :

For the first one, I used {cranlogs}, a package which show the download statistics for the RStudio CRAN mirror, from 2013 to now. It’s not all the CRAN downloads but it’s a nice part of them. We choose to compute the median download by month for each package, since it’s first appearance (so for months with more than 0 download). I did so since we got some strange stats for packages like {tidyverse} with an incredible amount of downloads in november 2018 or a big amount of downloads for {aws.s3} over the last year that I couldn’t understand.

For the dependency count, I used some centrality measures from graph network theory. What is important to me is to know how much times the package is listed as a dependency (so in-degree centrality), how many dependency it has (out-degree centrality) and if the package is “important” in the global network of the dependency graph. For this last measure, I used PageRank centrality, the same algorithm as Google.

With all of that, we get the following table with all the packages on CRAN, ranked by PageRank centrality :

I gathered the main infos in a graph, crossing centrality (in-degree) and popularity (downloads). I annotate some main areas :

And the authors ?

Talking about packages is nice. But behind them lies people. And a lot of people. For the top packages shown before, 59% of them are multi-authors packages (I counted all roles like authors or creator). And some have a lot of contributors, like the {ape} package. Some heterogeneity in role declarations could make it difficult to compare across packages.

R packages are a matter of collaborations since a lot of packages are multi-authored. They are built on top of some collaborations. And we could link the people together based on the package they worked on. As an example, here is the graph for all the people that participated in the “main” packages cited above.

The impression here is that the network is deeply interconnected ! I highlighted some groups who work a lot together, like RStudio employees or the R Core Team. But they are not isolated one from another. In fact they are links between.

To conclude

What can we learn from this little work ? First, collaboration is important in the R World ! People are designing packages together.